{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3610jvsc74a57bd0eb5e09632d6ea1cbf3eb9da7e37b7cf581db5ed13074b21cc44e159dc62acdab",
   "display_name": "Python 3.6.10 64-bit ('dataloader': conda)"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "## DataPipes development tutorial. Loaders DataPipes."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "As DataSet now constructed by stacking `DataPipe`-s it is recommended to keep `DataPipe` functionality as primitive as possible. For example loading data from CSV file will look like sequence of DataPipes: ListFiles FileLoader CSVParser.\n",
    "\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "`ExampleListFilesDataPipe` scans all files in `root` folder and yields full file names. Avoid loading entire list in `__init__` function to save memory."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import io\n",
    "import os\n",
    "\n",
    "from torch.utils.data import IterDataPipe, functional_datapipe\n",
    "\n",
    "\n",
    "class ExampleListFilesDataPipe(IterDataPipe):\n",
    "    def __init__(self, *, root):\n",
    "        self.root = root\n",
    "\n",
    "    def __iter__(self):\n",
    "        for (dirpath, dirnames, filenames) in os.walk(self.root):\n",
    "            for file_name in filenames:\n",
    "                yield os.path.join(dirpath, file_name)"
   ]
  },
  {
   "source": [
    "`ExampleFileLoaderDataPipe` registered as `load_files_as_string` consumes file names from source_datapipe and yields file names and file lines."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "@functional_datapipe('load_files_as_string')\n",
    "class ExampleFileLoaderDataPipe(IterDataPipe):\n",
    "    def __init__(self, source_datapipe):\n",
    "        self.source_datapipe = source_datapipe\n",
    "\n",
    "    def __iter__(self):\n",
    "        for file_name in self.source_datapipe:\n",
    "            with open(file_name) as file:\n",
    "                lines = file.read()\n",
    "                yield (file_name, lines)\n"
   ]
  },
  {
   "source": [
    "`ExampleCSVParserDataPipe` registered as `parse_csv_files` consumes file lines and expands them as CSV rows."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "@functional_datapipe('parse_csv_files')\n",
    "class ExampleCSVParserDataPipe(IterDataPipe):\n",
    "    def __init__(self, source_datapipe):\n",
    "        self.source_datapipe = source_datapipe\n",
    "\n",
    "    def __iter__(self):\n",
    "        for file_name, lines in self.source_datapipe:\n",
    "            reader = csv.reader(io.StringIO(lines))\n",
    "            for row in reader:\n",
    "                yield [file_name] + row\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['/home/vitaly/dataset/data/datapipes/load/iter/test/example_2.csv', '10', \" 'foo'\"]\n['/home/vitaly/dataset/data/datapipes/load/iter/test/example_2.csv', '11', \" 'bar'\"]\n['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '12', \" 'aaaa'\"]\n['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '13', \" 'bbbb'\"]\n"
     ]
    }
   ],
   "source": [
    "FOLDER = 'define your folder with csv files here'\n",
    "FOLDER = '/home/vitaly/dataset/data'\n",
    "dp = ExampleListFilesDataPipe(root = FOLDER).filter(lambda filename: filename.endswith('.csv')).load_files_as_string().parse_csv_files()\n",
    "\n",
    "for data in dp:\n",
    "    print(data)"
   ]
  },
  {
   "source": [
    "This approach allows to replace any DataPipe to get different functionality. For example you can pick individual files.\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '12', \" 'aaaa'\"]\n['/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv', '13', \" 'bbbb'\"]\n"
     ]
    }
   ],
   "source": [
    "FILE = 'define your file with csv data here'\n",
    "FILE = '/home/vitaly/dataset/data/datapipes/load/iter/test/example_1.csv'\n",
    "dp = ExampleFileLoaderDataPipe([FILE]).parse_csv_files()\n",
    "\n",
    "for data in dp:\n",
    "    print(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ]
}